class: center, middle, inverse, title-slide #
t
test a special case of regression? Strange but true … ### Peter Geelan-Small - Stats Central, UNSW ### 17/06/2021 ---
<style type="text/css"> .remark-slide-content { font-size: 28px; padding: 1em 1em 1em 1em; } </style> # Background Why think about link between `\(t\)` test and regression? <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- # Statistical modelling - Sensible modelling starts with *research questions* and *statistical analysis plan* written at start of study - Not so useful to ask, "What statistical tests do I need to use?" - "Flowchart" approach - hides unity behind "different" methods - Better to ask first, "How can I use the variables in my data to answer my research questions?" - Focus on relationships between variables - Helps you see not so many "different" methods after all! --- # Outline - Data for examples - Randomised controlled trial - Effect of hormones on preventing heart disease events among older women (Heart and Estrogen/progestin Replacement Study (HERS) (Hulley et al. 1998) - Data available through Vittinghoff (2012) - Example models - Regression to `\(t\)` tests - Model equations - What makes a linear model *linear*? - Why there might be fewer statistical methods than you think! --- # Outline ## Note - Modelling here does not follow a sensible statistical analysis plan - Models shown are only examples for today's topic - Response variables we're looking at are continuous - For these continuous response variables, model with normal distribution assumption makes sense - Assumptions for all models should be checked - only first one done here --- # Data Data has 37 variables from 2,763 subjects ``` ## 'data.frame': 2763 obs. of 7 variables: ## $ HT : Factor w/ 2 levels "hormone therapy",..: 2 2 1 2 2 1 2 1 1 1 ... ## $ age : int 70 62 69 64 65 68 70 69 61 62 ... ## $ BMI : num 23.7 28.6 42.5 24.4 21.9 ... ## $ drinkany: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 2 2 ... ## $ exercise: Factor w/ 2 levels "no","yes": 1 1 1 1 1 1 1 2 2 1 ... ## $ diabetes: Factor w/ 2 levels "no","yes": 1 1 2 1 1 1 2 1 1 1 ... ## $ glucose : int 84 111 114 94 101 116 120 95 105 98 ... ``` - Data for non-diabetics only will be used --- # Simple linear regression *Can BMI (body mass index) predict baseline glucose level?* .pull-left[ - Response: glucose - continuous - Predictor: BMI - continuous - Model assuming normal distribution - Straight-line model *Model equation* `$$\small Y = \beta_0 + \beta_1 X_1$$` ] .pull-right[ <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-7-1.png" style="display: block; margin: auto;" /> ] --- # Simple linear regression *Assumptions:* Check constant variance and normal distribution. OK! .pull-left[ <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> ] .pull-right[ <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> ] --- # Simple linear regression ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 83.2966 1.1386 73.1555 0 ## BMI 0.4841 0.0405 11.9636 0 ``` BMI is a useful predictor but model not very good (low `\(R^2\)`). *Fitted model* `$$\small \mathrm{glucose} = 83.3 + 0.48 \, \mathrm{BMI}$$` --- # Simple linear regression .pull-left[ What can we do with our fitted model? - Predict glucose from BMI - If BMI = 40, what is the predicted mean glucose level? `$$\small \mathrm{glucose} = 83.3 + 0.48 \times 40 = 103$$` ] .pull-right[ <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> ] --- # What is "linear"? - "Linear" means straight line. No argument! - Linear regression model means right-hand side of equation is a linear combination of predictors (i.e. pattern is: *parameter1* `\(\times\)` *predictor1* + *parameter2* `\(\times\)` *predictor2* + ... ) - Both these are linear models: `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI}$$` `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI}+ \beta_2 \, \mathrm{BMI}^2$$` - This is not a linear model: `$$\small \mathrm{glucose} = \beta_0 \, \mathrm{BMI}^{\, \beta_1}$$` --- # Multivariable linear regression *Is age a useful predictor of glucose level as well as BMI?* <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-16-1.png" style="display: block; margin: auto;" /> --- # Multivariable linear regression - Add *age* (continuous variable) to model *Model equation* `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI} + \beta_2 \, \mathrm{age}$$` ``` ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 79.1678 2.5194 31.4234 0.0000 ## BMI 0.4949 0.0409 12.1104 0.0000 ## age 0.0572 0.0312 1.8369 0.0664 ``` *Fitted model* `$$\small \mathrm{glucose} = 79.2 + 0.50 \, \mathrm{BMI} + 0.06 \, \mathrm{age}$$` --- # Multivariable linear regression *Fitted model* `$$\small \mathrm{glucose} = 79.2 + 0.06 \, \mathrm{age} + 0.50 \, \mathrm{BMI}$$` - Person who is 60 years old with BMI of 40: glucose = 102 `$$\small \mathrm{glucose} = 79.2 + 0.06 \times 60 + 0.50 \times 40 = 102$$` --- # Multivariable linear regression *Fitted model with data*
--- # Multivariable linear regression *Does exercise affect glucose level while adjusting for BMI?* (ANCOVA) .center[] --- # Multivariable linear regression How do BMI and exercise fit together in their effect on glucose? .center[] --- # Multivariable linear regression *Model equation (parallel lines) - two predictor variables* - BMI - continuous - exercise - categorical (no/yes coded as 0/1) `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI} + \beta_2 \, \mathrm{exer}_{\mathrm{yes}}$$` For exercise = no (i.e. 0): `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI}$$` For exercise = yes (i.e. 1):$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{BMI} + \beta_2 \;\;\; \mathrm{or} \;\;\; \mathrm{glucose} = (\beta_0 + \beta_2) + \beta_1 \, \mathrm{BMI}$$ --- # Multivariable linear regression ``` ## Estimate Std. Error t value ## (Intercept) 84.018 1.192 70.470 ## BMI 0.471 0.041 11.501 ## exerciseyes -0.865 0.427 -2.025 ``` *Fitted model* `$$\small \mathrm{glucose} = 84.0 + 0.47 \, \mathrm{BMI} -0.87 \, \mathrm{exer}_{\mathrm{yes}}$$` For exercise = no (i.e. 0): `$$\small \mathrm{glucose} = 84.0 + 0.47 \, \mathrm{BMI}$$` For exercise = yes (i.e. 1):$$\small \mathrm{glucose} = 84.02 + 0.47 \, \mathrm{BMI} - 0.87 \;\;\; \mathrm{or} \;\;\; \mathrm{glucose} = 83.2 + 0.47 \, \mathrm{BMI}$$ --- # Multivariable linear regression *Fitted model with data* .center[] --- # Predictor: categorical, two groups - *t* test *Does exercise alone affect glucose level?* (*t* test) .center[] --- # Predictor: categorical, two groups - *t* test - Response: glucose - Predictor: exercise - categorical (no/yes coded as 0/1) *Model equation (two parameters)* `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, \mathrm{exer}_{\mathrm{yes}}$$` ``` ## Estimate Std. Error t value ## (Intercept) 97.370 0.280 347.561 ## exerciseyes -1.644 0.435 -3.775 ``` For exercise = no: `\(\;\;\)` Est. mean glucose = 97.4 For exercise = yes: `\(\;\)` Est. mean glucose = 97.37 - 1.64 = 95.7 --- # Predictor: categorical, two groups - *t* test ``` ## ## Two Sample t-test ## ## data: glucose by exercise ## t = 3.7755, df = 2027, p-value = 0.0001643 ## alternative hypothesis: true difference in means between group no and group yes is not equal to 0 ## 95 percent confidence interval: ## 0.789975 2.497762 ## sample estimates: ## mean in group no mean in group yes ## 97.37006 95.72619 ``` `\(t\)` test is just a linear model - Special case - one categorical predictor variable with two groups --- # What's missing? ANOVA! *Does a subject's physical activity level compared to other women of similar age help to explain glucose level?* .center[] --- # What's missing? ANOVA! - Response: glucose - Predictor: physical activity - categorical (5 categories) *Model equation (five parameters, four variables)* `$$\beta_0 \;\;\; \mathrm{much\_less}$$` `$$X_\mathrm{less} \;\;\; \mathrm{less - coded \; as \; 0/1}$$` `$$X_\mathrm{eq} \;\;\; \mathrm{equal - coded \; as \; 0/1}$$` `$$\mathrm{etc.}$$` `$$\small \mathrm{glucose} = \beta_0 + \beta_1 \, X_\mathrm{less} + \beta_2 \, X_\mathrm{eq} + \beta_3 \, X_\mathrm{more} + \beta_4 \, X_\mathrm{much\_more}$$` --- # What's missing? ANOVA! Software turns categories of *physactive* into five 0/1 variables <table> <thead> <tr> <th style="text-align:left;"> physactive </th> <th style="text-align:right;"> Intercept </th> <th style="text-align:right;"> less </th> <th style="text-align:right;"> equal </th> <th style="text-align:right;"> more </th> <th style="text-align:right;"> much_more </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> much_more </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> <tr> <td style="text-align:left;"> much_less </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> much_less </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> less </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> equal </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> </tr> <tr> <td style="text-align:left;"> much_more </td> <td style="text-align:right;"> 1 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 0 </td> <td style="text-align:right;"> 1 </td> </tr> </tbody> </table> --- # What's missing? ANOVA! ``` ## Estimate Std. Error t value ## (Intercept) 98.421 0.934 105.390 ## physactiveless -0.858 1.078 -0.796 ## physactiveequal -1.214 1.005 -1.208 ## physactivemore -2.360 1.005 -2.348 ## physactivemuch_more -3.278 1.115 -2.941 ``` *Fitted model* `$$\small \mathrm{glucose} = 98.42 - 0.86 \, X_\mathrm{less} - 1.21 \, X_\mathrm{eq} - 2.36 \, X_\mathrm{more} - 3.28 \, X_\mathrm{much\_more}$$` --- # What's missing? ANOVA! *Fitted model* `$$\small \mathrm{glucose} = 98.42 - 0.86 \, X_\mathrm{less} - 1.21 \, X_\mathrm{eq} - 2.36 \, X_\mathrm{more} - 3.28 \, X_\mathrm{much\_more}$$` <table> <thead> <tr> <th style="text-align:left;"> Activity </th> <th style="text-align:left;"> Mean calc. </th> <th style="text-align:right;"> Fitted mean </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> much_less </td> <td style="text-align:left;"> 98.42 </td> <td style="text-align:right;"> 98.4 </td> </tr> <tr> <td style="text-align:left;"> less </td> <td style="text-align:left;"> 98.42 - 0.86 </td> <td style="text-align:right;"> 97.6 </td> </tr> <tr> <td style="text-align:left;"> equal </td> <td style="text-align:left;"> 98.42 - 1.21 </td> <td style="text-align:right;"> 97.2 </td> </tr> <tr> <td style="text-align:left;"> more </td> <td style="text-align:left;"> 98.42 - 2.36 </td> <td style="text-align:right;"> 96.1 </td> </tr> <tr> <td style="text-align:left;"> much_more </td> <td style="text-align:left;"> 98.42 - 3.28 </td> <td style="text-align:right;"> 95.1 </td> </tr> </tbody> </table> --- # Linear models **They're all the same type of model - a linear model!** <img src="data:image/png;base64,#t-tests_and_regression_xar_files/figure-html/unnamed-chunk-32-1.png" style="display: block; margin: auto;" /> --- # References Hulley, S. et al, 1998, Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women, *Journal of the American Medical Association* 280(7) 605-613 Vittinghoff, E., 2012, *Regression Methods in Biostatistics*, Springer, 2nd ed. (https://regression.ucsf.edu)